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Abstract 

FPGA technology can offer significantly higher performance at 
much lower power consumption than is available from CPUs and GPUs 
in many computational problems. Unfortunately, programming for 
FPGA (using hardware description languages, HDL) is a difficult and 
not-trivial task and is not intuitive for C/C++/Java programmers. 
To bring the gap between programming effectiveness and difficulty the 
High Level Synthesis (HLS) approach is promoting by main FPGA 
vendors. Nowadays, time-intensive calculations are mainly performed 
on GPU/CPU architectures, but can also be successfully performed us¬ 
ing HLS approach. In the paper we implement a bandwidth selection 
algorithm for kernel density estimation (KDE) using HLS and show 
techniques which were used to optimize the final FPGA implementa¬ 
tion. We are also going to show that FPGA speedups, comparing to 
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highly optimized CPU and GPU implementations, are quite substan¬ 
tial. Moreover, power consumption for FPGA devices is usually much 
less than typical power consumption of the present CPUs and GPUs. 

Keywords: FPGA, High Level Synthesis, Kernel Density Estima¬ 
tion, Bandwidth Selection, Plug-in Selector 


1 Introduction 

The probability density function (PDF) is a key concept in statistics with 
many practical applications, see for example EL pL8j . Constructing the most 
adequate PDF from the observed data is still an important and interesting 
research problem, especially for large datasets. PDFs are often calculated 
using nonparametric data-driven methods. One of the most popular non- 
parametric method is the kernel density estimation (KDE) |23) . However, a 
very serious drawback of using KDE is the large number of calculations re¬ 
quired to compute them as well as to find the optimal bandwidth (smoothing) 
parameter (time complexity 0(n 2 )). 

In this paper we investigate the possibility of utilizing field-programmable 
gate arrays (FPGA) to accelerate finding of such the optimal bandwidth. 
Towards the needs of the paper we have selected one popular and often used 
algorithm called in literature the PLUGIN [23]- This work can be considered 
as a continuation and extension of the paper |Tj, where the authors utilize 
GPUs for speeding up optimal bandwidth selection. One of the algorithm 
analyzed in that paper was the above mentioned PLUGIN one. 

Generally, there are two methodologies for speeding up complex numerical 
algorithms: software-based and hardware-based. In this paper we concen¬ 
trate only on hardware-based methods. The commonly known approaches 
are as follows: (a) computing on general purpose multicore CPU micropro¬ 
cessors, (b) computing on distributed environments (e.g. clusers, grids, etc.), 
(c) computing on GPUs, [2D] (d) computing on DSP units and (e) computing 
on FPGA chips [12] US] US 22, 123]. 

In the paper we are concerned with FPGA approach. In [9] the author 
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considers a problem how to use FPGA for fast computing of PDFs using direct 
VHDL programming approach. However, the problem we are concerning is 
of different nature as we concentrate our attention for computing the optimal 
bandwidth for PDF (see Chapter [21). 

To develop the final FPGA design we use the High Level Synthesis (HLS) 
approach 0 . ra. where no direct hardware description language (HDL) 
coding is needed (typically in VHDL or Verilog language 

The remainder of the paper is organized as follows. In section [2] we turn 
our attention to give the reader some preliminary information on KDE and 
bandwidth selection. In section [3] we give detailed mathematical formulas 
for calculating optimal bandwidth using the PLUGIN method. In section [3] 
we cover all the necessary details on our FPGA-based implementation. We 
also present practical experiments we carried out and discuss the results. In 
section O we conclude the paper. 

2 Kernel Density Estimation and Bandwidth 
Selection 

Let us consider a continuous univariate random variable X and let assume 
its probability density function / exists but is unknown. Its estimate, usually 
denoted by /, will be determined on the basis of a random sample of size n, 
that is Xi, X 2 ,..., X n (our experimental data). In such a case, a 1-dimensional 
kernel density estimator f(x, h ) of a real density f(x) for random sample 
X \, X 2 , ■ ■ ■, X n is given by the following formula 



( 1 ) 


Ut is worth to note that OpenCL framework, which is commonly used by GPU program¬ 
mers, becomes also available for FPGA devices. Nowadays, OpenCL is offered by Altera 
SDK for OpenCL to easily implement OpenCL applications for FPGA. Recently, Xilinx 
announced a similar solution, namely SDAccel Development Environment for OpenCL, C, 
and C++. 
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Figure 1: An example of using kernel density estimators for determining the 
probability density function. 


h is a positive real number called smoothing parameter or bandwidth and 
K(-) is the kernel function - a symmetric function that integrates to one. In 
practical applications K(-) has often the Gaussian normal form, that is 

K(u)= vfe exp (4“ 2 )- (2) 

If we have the bandwidth h, we can determine the estimator f(x, h) of the 
unknown density function f(x) using formula (JT]). The bandwidth h is the 
parameter which exhibits a strong influence on the resulting KDE. 

Formula (JT]) can be easily extended to the multivariate case. In the most 
general variant the scalar bandwidth h is replaced by the unconstrained band¬ 
width matrix H (which is symmetric and positive definite). Also (J2J) is gener¬ 
alized to the multivariate case. Two commonly used kernel types are product 
and radial (also known as spherically symmetric ) ones. 

As an example of how KDE works consider a toy dataset of 8 data points: 
X t = {0, 1, 1.1, 1.5, 1.9, 2.8, 2.9, 3.5}. Three different KDEs based on these 
data are depicted in Figure [0 It is easy to notice how the bandwidth h 
influences the shape of the KDE curve. Lines in bold show the estimated 
PDFs, while normal lines show the shapes of individual kernel functions K{x) 
(Gaussians). Dots represent the data points X t . 

Choosing the best value of h is not a trivial task and this problem was and 
still is extensively studied in literature 0 . 0 - Currently available selectors 
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can be roughly divided into 3 classes [TB], [23]. 

The first class uses very simple and easy to calculate mathematical for¬ 
mulas. They were developed to cover a wide range of situations, but do 
not guarantee being enough close to the optimal bandwidth. They are often 
called rules-of-thumb methods. 

The second class contains methods based on least squares and cross- 
validation ideas with more precise mathematical arguments, but they re¬ 
quire much more computational effort. However, in reward for it, we get 
bandwidths more accurate for a wider range of density functions. 

The third class contains methods based on plugging in estimates of some 
unknown quantities that appear in formulas for the asymptotically optimal 
bandwidth. 

One selected method from the third class is investigated in the paper. The 
method is briefly presented in Chapter [3] and from now on it will abbreviated 
as the PLUGIN. 


3 The PLUGIN Method and Data Preprocess¬ 
ing 

In Algorithm |T] we give recipe for calculation of the optimal bandwidth using 
the PLUGIN method (the symbols used are exactly such as in the book 
[23]). All the necessary details on the method, as well as details on deriving 
of particular mathematical formulas can be found in many source materials, 
see for example books [TD1IHJ |W, S3]. 

It is important to stress that the PLUGIN algorithm is a strictly sequen¬ 
tial computational process (see Figure [2] parallel processing is possible only 
internally in Steps IV and VI) as every step depends on the results obtained 
in the previous steps. First we calculate the variance and the standard de¬ 
viation estimators of the input data, see Step I in Algorithm |T] Then we 
calculate some more complex formulas from Step II to Step VI. Finally, we 
can substitute them into equation given in Step VII to get the searched 
optimal bandwidth value h. 
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Data: data set X, contains n elements 

Result: value h represents the optimal bandwidth for kernel density estimation 
Step I: Calculate the estimates of variance (V) and standard deviation (a): 


1 1 
V i - V x 2 - ——— . / 

n— 1 ^ n(n — 1) \ z —' 

i— 1 v 7 \ 


•n \ 

'E Xj j , a vE. 


Step II: Calculate the estimate of functional tl> 8 : 


V" s <- 


105 


32 V / 7T(7 9 ' 


Step III: Calculate the bandwidth of the kernel estimator of function f^ (fth 
derivative of function f, that is f h) = ): 

\/a 2 (K)<it^ s n J V2tt 

Step IV: Calculate the estimate 4 / 6 ( 31 ) of functional 


^(ffi) < - 

n 2 g{ 

1 


E£* (6> 

*= 1 3=1 


X, - Xj 

9i 


K 6 (x) = —= (x 6 - 15a; 4 + 45a; 2 - 15) e~* x . 

V2tt 

Step V: Calculate the bandwidth of the kernel estimator of function /^: 

1/7 


9 2 


/ -2X 4 (0) 


, X 4 (0) = ^2(X) = 1 


\Li 2 (K)i> & (gi)n) ’ 

Step VI: Calculate the estimate 414 ( 32 ) of functional 4U: 


4 , 4(ff2) 


n 2 3| 


EE* (4) 

i=ij=i 


X, - X,- 

92 


K\x) = -E= (x 4 - 6z 2 + 3) e"^ 2 . 

Step VII: Calculate the final value of the bandwidth h: 

\ 1/5 


h «- 


J?(X) 


/i2(X) 2 4' 4 (3 2 )« . 


, R(X) = E M2 (X) = 1 


Algorithm 1: Main computational steps of the PLUGIN algorithm 
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Figure 2: Flowchart of the PLUGIN algorithm with optional data prepro¬ 
cessing (z-score standardization) 
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Our implementation of the Algorithm [T] is carried out in fixed-point arith¬ 
metic (see section m- Unfortunately, using the raw data while conducting 
the required calculations, threatens a potential problems with overflow, es¬ 
pecially while calculating the value of see Step II in Algorithm [TJ Note 
that the estimate of standard deviation in is raised to the power of 9. 
For large values of a it results in extremely small values of The above 
problems can be successfully overcome if the input datasets are standardized 
using the z-score formula, that is 


Z< = 


Xi h 


cr 


(3) 


where ft and a are mean and standard deviation of the original vector X 
respectively. Z-score guarantees that a — 1 in and, consequently, 
entity has simply a constant value. 

Applying the data standardization requires an extra operation on the h 
value in Step VII in Algorithm [U that is 


^filial h • (7. 


(4) 


where h is the bandwidth calculated for the standardized dataset and a is 
the standard deviation of the original vector X. The correctness of the above 
equation can be easily proofed algebraically. 

To reduce the calculation burden we can also slightly change equations 
^(pi) and \Ein Algorithm CD It is easy to notice a symmetry, that is 

Xi - XA 

So, the double summations can be changed and, consequently, the final 
formula for ^(pi) has now the following form 


A d6) 


Xi 


Xi 


9 1 


(5) 



^eigi) t - 

n 2 g{ 


n n 


2 ( ]T K {6) ( Xi Xj ) ] +nK^( 0) 

2—1 j=l,i<j ^ 


( 6 ) 


(note for different summation ranges, the „2” before sums and an extra factor 









added, that is nK^\ 0)). Obviously, the same concerns it'd) and $ 4 (g 2 ) 



Computation complexity of Steps IV and VI (double summations), where 
the symmetry property is used, still belongs to 0(n 2 ) complexity class 


n 



i =1 j=* 


where = T) + T 2 + T 3 , and T) represents computation time for the dif¬ 
ferences, X2 represents division time, and X3 represents time for computing 
iO 6 ) and K^ 4> polynomials. 

4 FPGA-based Implementation 

4.1 Xilinx’s High Level Synthesis 

High Level Synthesis (HLS) is an automated design process that interprets an 
algorithmic description of a problem (given in high level languages 0/0+4*) 
and translates this problem into a so called register-transfer level (RTL) HDL 
code. Then in turn this HDL code can be easily synthesized to the gate level 
by the use of a logic synthesis tool, like for example Xilinx ISE Design Suite, 
Xilinx Vivado Design Suite, Altera Quartus II. 

In this paper we discuss results obtained using a tool called Xilinx Vivado 
High Level Synthesis, a feature of Vivado Design Suite. This tool supports 
C/C++ inputs, and generates VHDL/Verilog/SystemC outputs. Other so¬ 
lutions are offered by Scala programming language f2] and a specialised high 
level synthesis language called Cx m It should also be mentioned that a 
similar tool called A++ is also available for Altera FPGA devices. 
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4.2 Implementation Preliminaries 

Before implementing the PLUGIN Algorithm Q] it is important to take some 
assumptions affecting both performance and resource consumption. 

The first assumption is about a proper arithmetic used. The floating¬ 
point one gives very good range and precision. Unfortunately, from FPGA’s 
point of view, this representation is very resource demanding. In contrast, 
the fixed-point arithmetic is much less resource demanding but its range and 
precision are more limited. 

Hence, the exact fixed point representation was determined based on a 
careful analysis of the particular intermediate values taken during calcula¬ 
tions. If the input dataset doesn’t contain extremely large outliers (which 
suggests that such dataset should be first carefully analyzed before any sta¬ 
tistical analysis taken) and if the z-score standardization is used, Q 32.32 
fixed point representation is sufficient for all calculations (that is: integer 
part length m = 31, fractional part length n = 32, word length N = 64 and 
the first bit represents the sign). Note also that as a result of the z-score 
standardization, the vales of V, a, are constant and this significantly 
simplifies the calculations. The fractional part does give the required preci¬ 
sion. However, the integer part must also be sufficiently large, as n 2 factors 
are present in the PLUGIN algorithm. 

The second assumption is about choosing the most adequate methods for 
calculating individual steps in Algorithm [TJ Now it needs to be stressed that 
programming for FPGA devices differs considerably from programming for 
CPUs/GPUs devices. FPGA devices are built from a large number of simple 
logical blocks like: Look Up Tables (LUT), Flip-Flops (FF), Block RAMs 
(BRAM), specialized DSP units (DSP). These blocks can be connected each 
other and can implement only relatively low-level logical functions (the so 
called gates level). As a consequence, even very basic operations, like for ex¬ 
amples the adder for adding two numbers must be implemented from scratch. 
In description of the PLUGIN Algorithm [T] one can easily indicate such op¬ 
erators like (a) addition, (b) subtraction, (c) multiplication, (d) division, (e) 
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reciprocal, (f) exponent, (g) logarithm^, (h) power, (i) square roots, (j) higher 
order roots. 

Our implementation utilizes the following methods: CORDIC [5j,|6j for 
calculating exponents and logarithms, divisions were replaced by multipli¬ 
cations and reciprocals, difference operators were replaced by addition of 
negative operands. Additionally, one extra implementation of the exponent 
function was used for calculations of K '^ and j n Algorithm [1] This 
implementation is based on the Remez algorithm 0. ra, m and is open to 
pipelining. As a consequence, a significant speedup can be achieved during 
calculations of Steps IV and VI in Algorithm [D 

It is also worth to note that the authors’ implementation of the division 
operator (base on multiplications and reciprocals; the reciprocal is based on 
the Newton method) is significantly faster than the default division operator 
available in Vivado HLS. Moreover, the another advantage of using our own 
operators, is that no IPCore (Xilinx’s library of many specialized functions 
available for FPGA projects) is needed. As a consequence, the generated 
VHDL codes are more portable for FPGA chips from different than Xilinx 
vendors. 

The third assumption during implementing of the PLUGIN algorithm was 
to enable the nominal clock frequency of an FPGA chip used (see chapter 
14.41 for details). During experiments it was turned out that the usage of the 
original division operator resulted in problems with reaching the required 
frequency. The authors’ original implementation of the division operator 
(base on multiplications and reciprocals) solved this problem. 

The forth assumption was that all the input datasets must be stored in the 
BRAM memory, which are available in almost all current FPGA chips. They 
have enough capacity to store truly large data, like even 500,000 elements or 
more. 

2 Logarithm is not directly present in the PLUGIN mathematical formulas, but it is 
used while implementing higher order roots from the following definition x v = exp(ylnir). 
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Figure 3: General overview of the main units for the FPGA-based PLUGIN 
algorithm implementation 

4.3 Implementation Details 

In Figure [3] we show the scheme of the PLUGIN implementation where all 
the main components are presented. They correspond literally to the seven 
steps shown in Algorithm Q] 

Figure [4] presents general architecture of the functional unit for comput¬ 
ing 44 (^ 2 ) (Step VI in Algorithm [T]) . It is worth to note that the proper 
architecture of this unit must be reached during careful coding in Vivado 
HLS, using techniques like listed in section [4721 

We developed three different versions of the PLUGIN algorithm. 

The first implementation , called literal , is just a literal rewriting of Al¬ 
gorithm Q] (with the improvements (J6j) and (17jl). No additional actions were 
taken toward optimization of both execution time and resource requirements. 
This version can operate with any unsealed input data (assuming that all the 
inputs as well as all the internal results fulfil the fixed-point ranges that have 
been set). This version automatically (Vivado decides) utilizes pipelining. 
However, the pipelining doesn’t make the implementation enough fast and 
additionally, large number of DSP blocks is used. FFs and LUTs usage is 
also quite big (see Table [T]). 

The second implementation , called minimal, is written so that it is opti¬ 
mized for resource utilization, mainly the DSP units. To reduce the number 
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Figure 4: General architecture of the ^ 4 (^ 2 ) unit at the block-level view. 
The extra frame called ul-part shows the part of the Step VI in Algorithm [T] 
where loop unrolling can be used 

of the DSP units some dedicated functions for addition and multiplication 
are required. Using Vivado HLS compiler’s pragmas ( #pragma HLS IN¬ 
LINE off) pipelining can be disabled (on default, during translation of the 
high level codes into HDL ones pipelining is enabled whenever it is possi¬ 
ble). As can be observed in Tabled] a significant reduction of the DSP units 
was achieved. It confirms the fact that Vivado HLS is very sensitive for the 
structure of the high level codes being translated into HDL ones. So that, 
to achieve good performance and resource usage many modifications of the 
high level codes are required. 

The third implementation , called fast, is written so that it is optimized 
for time execution. Addition and multiplication functions were implemented 
in two ways. In the first way (similar as in minimal implementation) the 
pipelining is disabled, while in the second way it is enabled. The pipelined 
versions of the functions are used in Steps IV and VI in Algorithm [I] as these 
two steps are crucial for the final performance. Additionally, in these two 
steps a dedicated implementation of the exponent function was used (based 
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on Remez algorithm which is more likely to pipelining). Also, a technique 
known as loop unrolling was used in a manual manner (see sample codes in 
Figure [6]). Although Vivado HLS uses automatic loop unrolling, this feature 
doesn’t work correctly in our algorithm (as it can operate with datasets of 
any size and the exact number of loops is not known in advance). 

The forth and fifth implementations used during experiments (called CPU 
and GPU respectively) are the ones implemented and investigated in |Tj. 
CPU implementation utilizes the SSE (Streaming SIMD Extensions) of the 
current multicore CPUs. 

4.4 Results 

During all practical experiments the target Xilinx Virtex-7 xc7vx690tffg 1761- 
2 device was used. Its nominal working frequency is 200MHz (or 5 ns for 
a single clock tact). CPU implementation was run on Intel Processor i7 
4790k 4-0 GHz. Geforce 480GTX graphics card was used for GPU imple¬ 
mentation. Vivado HLS ver. 2015.2 was used for developing all the FPGA 
implementations. 

The summary of the resource consumption is given in Table [Q Addi¬ 
tionally, power consumption is included. It is a real power (in Watts) taken 
by the FPGA chip after physical implementation of the PLUGIN algorithm 
using Vivado Design Suite. The power consumption of the FPGA implemen¬ 
tations is significantly smaller comparing with the power consumption of the 
CPU and GPU implementations. The power consumption for the CPU and 
GPU used in our experiments are an average (catalogue-like) values. 

The summary of the execution times for three different implementations 
of the PLUGIN algorithm, as well as CPU and GPU ones is given in Table [2j 
The minimal and the fast implementations were run on 200MHz nominal 
clock while the literal implementation was run with 166 MHz nominal clock. 
This frequency degradation was caused mainly because of some limitations 
of the original division operator implemented in Vivado HLS. 

Of course the best performance was achieved for the fast implementation 
(even compared to the CPU and to the GPU implementations). This is the 
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Table 1: Resources usage for three different FPGA implementations of the 
PLUGIN algorithms as well as CPU and GPU implementations. Additionally 
power consumption is included. 

Method BRAM 18k DSP FF LUT Watts 

literal 128 U64 80753 81995 3.938 

minimal 128 240 15889 22895 1.153 

fait 128 1880 85775 38050 6.963 

CPU - - - « 88 

GPU - - - « 250 


Table 2: Execution times (in sec.) for three different FPGA implementations 
of the PLUGIN algorithm and for CPU and GPU implementations 
n literal minimal fast CPU GPU 

128 0.0555 0.0324 0.000276 0.0210 0.00699 

256 0.2266 0.1363 0.000560 0.0252 0.00788 

384 0.5155 0.3152 0.000889 0.0322 0.00947 

512 0.9112 0.5513 0.001257 0.0346 0.00962 

640 1.4466 0.8968 0.001667 0.0361 0.01063 

768 2.1023 1.3205 0.002114 0.0375 0.01172 

896 2.8771 1.8232 0.002606 0.0405 0.01447 

1024 3.7666 2.3926 0.003140 0.0427 0.01641 


result of combination of the following three optimization techniques used: 
(a) implementation of some dedicated arithmetic operators, (b) a proper 
exponential function approximation and (c) the for loops unrolling. 

A very significant speedup was achieved comparing the fast and the lit¬ 
eral implementation (average speedup about 760, see Table |3|). The fast 
implementation is faster then the CPU implementation (average speedup 
about 32, see Table [3]). The fast implementation is also faster then the GPU 
implementation (average speedup about 10, see Table |3|). 

The summary of the accuracy for three different implementations of the 
PLUGIN algorithm is given in Table SJ h re f is the reference bandwidth calcu¬ 
lated in double floating point arithmetic (in C+T program, 15-17 significant 
decimal digits). It is worth to note that the relative errors for literal , min¬ 
imal and fast implementations are very small (not more than 0.004%). In 
practical applications such small values can be in fact neglected. 
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Table 3: Speedups for three different FPGA implementations of the PLUGIN 
algorithm and for CPU and GPU implementations 


n 

literal / fast 

minimal / fast 

CPU/fast 

GPU/fast 

128 

201 

118 

76 

25 

256 

404 

243 

45 

14 

384 

580 

354 

36 

11 

512 

725 

439 

28 

8 

640 

868 

538 

22 

6 

768 

994 

625 

18 

6 

896 

1104 

700 

16 

6 

1024 

1200 

762 

14 

5 


Performance of the PLUGIN algorithm implementations 



Figure 5: Performance and scalability of different PLUGIN algorithm imple¬ 
mentations (for better readibility log scale for Y axis is used) 


The summary of the scalability of different PLUGIN algorithm implemen¬ 
tations is presented in Figure 0 Scalability of the FPGA implementations is 
nearly linear, which is a very welcome behavior. The corresponding results 
for CPU and GPU implementations can be found in pQ. The figure is in fact 
a graphical summary of data given in Table [21 

Simplified source codes of the three FPGA implementations are presented 


16 











































































































Table 4: Accuracy (relative error) for three different FPGA implementations 
of the PLUGIN algorithms. h re f was calculated in C+— direct implemen¬ 
tation of Algorithm Q] in floating point double arithmetic (15-17 significant 
decimal digits). |4| = * 100% where h method is h Xiterah h minimal 

or hfast _ 


n 

hliteral 

href 

|4| (%) 

128 

0.304902711650357 

0.304902701728222 

3.25e-06 

256 

0.227651247521862 

0.227651285449348 

1.67e-05 

384 

0.202433198224753 

0.202433187549741 

5.27e-06 

512 

0.242707096505910 

0.242707026022425 

2.9e-05 

640 

0.190442902734503 

0.190443702342891 

0.00042 

768 

0.175199386896566 

0.175199406819444 

1.14e-05 

896 

0.172251554206014 

0.172251524317464 

1.74e-05 

1024 

0.174044180661440 

0.174044236921001 

3.23e-05 

n 

hminimal 

href 

|4| (%) 

128 

0.304902980336919 

0.304902701728222 

9.14e-05 

256 

0.227651586290449 

0.227651285449348 

0.000132 

384 

0.202433346537873 

0.202433187549741 

7.85e-05 

512 

0.242707266006619 

0.242707026022425 

9.89e-05 

640 

0.190443017752841 

0.190443702342891 

0.000359 

768 

0.175199396442622 

0.175199406819444 

5.92e-06 

896 

0.172251742798835 

0.172251524317464 

0.000127 

1024 

0.174044403014705 

0.174044236921001 

9.54e-05 

n 

hfast 

href 

141 (%) 

128 

0.304901758907363 

0.304902701728222 

0.000309 

256 

0.227651913650334 

0.227651285449348 

0.000276 

384 

0.202433891594410 

0.202433187549741 

0.000348 

512 

0.242707268567756 

0.242707026022425 

9.99e-05 

640 

0.190443484811112 

0.190443702342891 

0.000114 

768 

0.175199736841023 

0.175199406819444 

0.000188 

896 

0.172251721611246 

0.172251524317464 

0.000115 

1024 

0.174044031649828 

0.174044236921001 

0.000118 


in Figure [HI Complete source codes (CAT and resulted Vivado HLS transla¬ 
tions into VHDL) are available on request. The first version is just the literal 
implementation of the step VI in Algorithm Q] in C language. Unfortunately, 
as can be observed in Table [2] and in Figure [5] such implementation is very 
slow. In the second version multiplications and additions are realized using 
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dedicated functions ( fADD , fMUL). Also a dedicated function for recipro¬ 
cal operator was implemented. In the third version much more modification 
was implemented. First, loop unrolling was used, second, Vivado HLS prag¬ 
mas were used and third, multiplications and additions were realized using 
dedicated functions with pipelining enabled ( pfADD , pfMUL). 
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// literal implementation 
psi4_fl: for( i=0; i<N; i++ ) { 

psi4_f2: for( j=i+l; j<N; j++ ) { 

s = s + k4( ( ( x[i] - x[j] ) / g2) ); 

> 

> 

// minimal implementation 
rg2 = reciprocal( g2 ); 
psi4_fl: for( i=0; i<N; i++ ) { 

psi4_f2: for( j=i+l; j<N; j++ ) { 

s = fADD( s, k4( fMUL( fADD( x[i], —x [ j] ), rg2 ) ) ); 

> 

> 

// fast implementation 
rg2 = reciprocal( g2 ); 
psi4_fl: for( i=0; i<N; i++ ) { 

psi4_f2: for( j=i+l; j<N; j+=2 ) { 

#pragma HLS EXPRESSION_BALANCE 
#pragma HLS PIPELINE 


if( j == i+1 ) tmp = 0.0; 

if( j<N ) { tmpl = 0.0; tmp2 = 0.0; } 

psi4_fl_b0: { 

tmpla = pfADD( x[i], —x [ j] ); 
tmpva = pfMUL( tmpla, rg2 ); 
tmpl = k4( tmpva ); 

> 

psi4_fl_bl: { 

if( (j+1) < N ) { 

tmplb = pfADD( x[i], -x[j+l] ); 
tmpvb = pfMUL( tmplb, rg2 ); 
tmp2 = k4( tmpvb ); 

> 

> 

if( j<N ) { tmp = pfADD( tmp, tmpl ); tmp = pfADD( tmp, tmp2 ); } 
if( j+2>=N ) s = pfADD (s, tmp ); 

> 

> 


Figure 6: Three fundamental methods of the for loop implementation used 
in T 4 (^ 2 ) calculation (step VI in Algorithm [TJ step IV is implemnented in 
the same way). In the fast implementation the loop unrolling is used twice. 
fADD, fMUL functions don’t utilize pipelining, while pfADD i pfMUL func¬ 
tions do it 
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5 Conclusions 


HLS tools are competitive with manual design techniques using HDLs. Imple¬ 
mentation time of complex numerical algorithms can be essentially reduced 
(comparing to direct coding in HDL languages). 

Unfortunately, to obtain efficient FPGA implementations, many changes 
to source codes are required, comparing to equivalent implementations for 
CPUs and/or GPUs. This is because FPGA devices use specific primitives 
like DSP, BRAM, FF and LUT blocks and programmers should control their 
utilization manually. However, this control is performed on the level of 
C/C+Hb codes, not the HDL ones. It is also worth to stress that using 
the HLS approach allows to obtain implementations which are often faster 
than CPU and/or GPU counterparts. 

Another crucial motivation for replacing GPU or CPU solutions by their 
FPGA equivalents is power consumption. FPGA can settle for single Watts, 
while CPU or GPU counterparts typically take tens/hundreds of Watts or 
even more. 

Another possible step toward fast implementations of numerical algo¬ 
rithms could be considering of using modern DSP chips which offer many 
interesting possibilities. 
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